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Abstract 

Computation of the marginal likelihood from a simulated posterior distribution is central 
to Bayesian model selection but is computationally difficult. The often-used harmonic mean 
approximation uses the posterior directly but is unstably sensitive to samples with anomalously 
small values of the likelihood. The Laplace approximation is stable but makes strong, and of- 
ten inappropriate, assumptions about the shape of the posterior distribution. It is useful, but 
not general. We need algorithms that apply to general distributions, like the harmonic mean 
approximation, but do not suffer from convergence and instability issues. Here, I argue that the 
marginal likelihood can be reliably computed from a posterior sample by careful attention to 
the numerics of the probability integral. Posing the expression for the marginal likelihood as a 
Lebesgue integral, we may convert the harmonic mean approximation from a sample statistic 
to a quadrature rule. As a quadrature, the harmonic mean approximation suffers from enor- 
mous truncation error as consequence . This error is a direct consequence of poor coverage 
of the sample space; the posterior sample required for accurate computation of the marginal 
likelihood is much larger than that required to characterize the posterior distribution when us- 
ing the harmonic mean approximation. In addition, I demonstrate that the integral expression 
for the harmonic-mean approximation converges slowly at best for high-dimensional problems 
with uninformative prior distributions. These observations lead to two computationally-modest 
families of quadrature algorithms that use the full generality sample posterior but without the 
instability. The first algorithm automatically eliminates the part of the sample that contributes 
large truncation error. The second algorithm uses the posterior sample to assign probability 
to a partition of the sample space and performs the marginal likelihood integral directly. This 
eliminates convergence issues. The first algorithm is analogous to standard quadrature but can 
only be applied for convergent problems. The second is a hybrid of cubature: it uses the poste- 
rior to discover and tessellate the subset of that sample space was explored and uses quantiles to 
compute a representative field value. Qualitatively, the first algorithm improves the harmonic 
mean approximation using numerical analysis, and the second algorithm is an adaptive version 
of the Laplace approximation. Neither algorithm makes strong assumptions about the shape 
of the posterior distribution and neither is sensitive to outliers. Based on numerical tests, we 
recommend a combined application of both algorithms as consistency check to achieve a reliable 
estimate of the marginal likelihood from a simulated posterior distribution. 

Keywords: Bayesian computation, marginal likelihood, algorithm, Bayes factors, model selec- 
tion 



* E-mail: weinberg@astro.umass.edu 



1 



1 Introduction 



A Bayesian data analysis specifies joint probability distributions to describe the relationship be- 
tween the prior information, the model or hypotheses, and the data. Using Bayes theorem, the 
posterior distribution is uniquely determined from the conditional probability distribution of the 
unknowns given the observed data. The posterior probability is usually stated as follows: 

P(WD)= «M (!) 

where 

Z = P(p\M) = J d0v(O\M)L(p\O,M) (2) 

is the marginal likelihood. The symbol Ai denotes the assumption of a particular model and the 
parameter vector 9 £ f2. For physical models, the sample space is most often a continuous space. 
In words, equation ([TJ) says: the probability of the model parameters given the data and the model 
is proportional to the prior probability of the model parameters and the probability of the data 
given the model parameters. The posterior may be used, for example, to infer the distribution 
of model parameters or to discriminate between competing hypotheses or models. The latter is 
particularly valuable given the wide variety of astronomical problems wh e re diy erse hypotheses 
describing heterogeneous physical systems is the norm (see Gelman et al. . 20031 . for a thorough 
discussion of Bayesian data analysis). 

For parameter estimation, one often considers P(D\A4) to be an uninteresting normalization 
constant. However, equation ([2]) clearly admits a meaningful interpretation: it is the support or 
evidence for a model given the data. This see this, assume that the prior probability of some 
model, Aij say, is P(Aij). Then by Bayes theorem, the probability of the model given the data is 
P{Mj\D) = P(Mj)P(p\Mj)fP(p). The posterior odds of Model j = relative to Model j = 1 
is then: 

P(M \T>) = P(Mq)P(D\Mq) 
P(Mi\B) P(Mi) P(T>\Mi)' 

If we have information about the ratio of prior odds, P(Mo)/P(Mi), we should use it, but more 
often than not our lack of knowledge forces a choice of P(Mq) / P(A4\) = 1. Then, we esti- 
mate the relative probabi lity of the models giy e n D over their prior odds by the Bayes factor 



P(D\Mq) / P(D\Mi) (see lLavine and Schervishl . Il999l . for a discussion of additional concerns). 



When there is no ambiguity, we will omit the explicit dependence on Ai of the prior distribution 
likelihood function, and marginal likelihood for notational convenience. 



The Bayes factor has a number of attractive advantages for model selection (jKass and Rafterv . 



19951 ): (1) it is a consistent selector; that is, the ratio will increasingly favor the true model in 
the limit of large data; (2) Bayes factors act as an Occam's razor, preferring the simpler model 
if the "fits" are similar; and (3) Bayes factors do not require the models to be nested in any 
way; that is, the models and their parameters need not be equivalent in any limit. There is a 
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catch: direct computation of the marginal likelihood (eq. [2]) is intractable for most problems of 
interest. However, recent advances in computing technology together with developments in Markov 
chain Monte Carlo (MCMC) algorithms have the promise to compute the posterior distribution 
for problems that have been previously infeasible owing to dimensionality or complexity. The 
posterior distribution is central to Bayesian inference: it summarizes all of our knowledge about 
the parameters of our model and is the basis for all subsequent inference and prediction for a given 
problem. For example, current astronomical datasets are very large, the proposed models may 
be high-dimensional, and therefore, the posterior sample is expensive to compute. However, once 
obtained, the posterior sample may be exploited for a wide va r iety o f tasks. Although dimension- 
switching algorithms, such as reversible-jump MCMC (|Greenl . Il995l ) incorporate model selection 
automatically without need for Bayes factors, these simulations appear slow to converge for some of 
our real-world applications. Moreover, the marginal likelihood may be used for an endless variety 
of tests, ex post facto. 



Newton and Raftervl ()1994l ) presented a formula for estimating Z from a posterior distribution 
of parameters. They noted that a MCMC simulation of the posterior selects values of € f2 
distributed as 



Z x P(0|D) = L(D|0)vr(0) 



and, therefore, 



Z x 



P(0|D) 



or 



1 

~Z 



P(0\T>) 



dOir(O) = 1 



n d6 L(V\0) 



E 



L(0|D) 



F(0|D) 



(3) 



(4) 



having suppressed the explicit dependence on Ai for notational clarity. This latter equation says 
that the marginal likelihood is the harmonic mean of the likelihood with respect to the posterior 
distribution. It follows that the harmonic mean computed from a sampled posterior distribution is 
an estimator for the marginal likelihood, e.g.: 



Z 



1 N 

-Y 



1 



(5) 



Unfortunately, this estima tor is prone to dom ination by a few outlyin g terms with a bnormally 



small values of Lj (e.g. see Rafterv et al. . 2007, and references the rein). Wolpert ( 20021 ) describ 



convergence criteria for equation (0) and lChib and Jeliazkovl (|200ll ) present augmented approaches 
with error estimates. 

Alternative approaches to co mputing the marginal lik elihood from the posterior distribution 
have been described at length by iKass and Rafterv (1 19951 ). Of these, the Laplace approximation, 
which approximates the posterior distribution by a multidimensional Gaussian distribution and uses 
this approximation to compute equation ([2]) directly, is the most widely used. This seems to be 
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favored over equation (|29p because of the problem with outliers and hence because of convergence 
and stability. In many cases, however, the Laplace approximation is far from adequate in two ways. 
First, one must identify all the dominant modes, and second, modes may not be well-represented 
by a multidimensional Gaussian distribution for p roblems of practical i nterest, altho ugh many 



promising improvements have been sugg ested (e.g. IPiCicio et all . Il997h . iTrottal hoofl) explored 



the use of the Savage-Dickey density ratio for cosmological model selection (see also Trotta . 20081 . 
for a full review of the model selection problem for cosmology). 

Finally, we may consider evaluation of equation ([2]) directly. The MCMC simulation samples 
the posterior distribution by design, and therefore, can be used to construct volume elements in k- 
dimensional parameter space, dO, e.g. when Q C M. k . Although the volume will be sparsely sampled 
in regions of relatively low likelihood, these same volumes will make little contribution to equation 
([2]). The often-used approach from computational geometry, Delaney triangulation, maximizes 
the minimum angle of the facets and thereby yields the "roundest" volumes. Unfortunately, the 
standard procedure scales as 0(kN 2 ) for a sample of N poi nts. This can be reduced t o O [N log N+ 
N k l 2 ) using the flip algorithm with iterative construction ( Edelsbrunner and Shah . 1966) but this 
scaling is prohibitive for large ./V and k typical of many problems. Rather, in this paper, we consider 
the less optimal but tractable kd-tree for space partitioning. 

In part, the difficulty in computing the marginal likelihood from the sampled posterior has 
recently led ISkillind (|2006l . "nesting sampling" ) to suggest an algorithm to simulate the marginal 
likelihood rather than the posterior distributi o n. This idea has been a dopted and extended by 



cosmological modelers ( Mukherjee et al. . 20061 ; Feroz and Hobson . 2008). The core idea of nest 



ing sampling follows by rewriting equation 
integration, e.g. 



as a double integral and swapping the order of 



Z 



L(D|0) 



dy 



su P {L(r>\ey.een} 



dy 



L(D\e)> y 



d6ir(6). 



(6) 



The nested sampler is a Monte Carlo sampler for the likelihood function L with respect to the prior 
distrbution ir so that L > y. The generalization of the construction in equation ([6]) for general 
distributions and multiple dimensions is the Lebesgue integral (see Clearly, this procedure has 
no problems with outliers with small values of L(D\0). Of course, any algorithm implementing 
nested sampling must still thoroughly sample the multidimensional posterior distribution and so 
retains all of the intendant difficulties that MCMC has been designed to solve. 

In many ways, the derivation of the nested sampler bears a strong resemblance to the derivation 
of the harmonic mean but without any obvious numerical difficulty. This led me to a careful study of 
equations (|2|) and ([3]) to see if the divergence for small value of likelihood could be addressed. Indeed 
they can, and the following sections describe two algorithms based on each of these equations. 
These new algorithms retain the main advantage of the harmonic mean approximation (HMA): 
direct incorporation of the sampled posterior without any assumption of a functional form. In this 
sense, they are fully and automatically adaptive to any degree multimodality given a sufficiently 
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Figure 1: Illustration of the integral J^dxf(x) using Riemann and Lebesgue integration. For 
Riemann integration, we sum thin vertical rectangles of height f{X) about the abscissa point X 
for some width dx. For Lebesgue integration, we sum thin horizontal rectangles of width M(Y) 
about the ordinate point Y of height dy. In both cases, we sum the area under the curve f(x). In 
the Lebesgue case, we must add in the rectangle of width M(yo) and height y^. 



large sample. We begin in $2] with a background discussion of Lebesgue integration applied to 
probability distributions and Monte Carlo (MC) estimates. We apply this in to the marginal 
likelihood computation. This development both illuminates the arbitrariness in the HMA from the 
numerical standpoint and leads to an improved approach outlined in $H In short, the proposed 
approach is motivated by methods of numerical quadrature rather than sample statistics. Examples 
in £0 compare the application of the new algorithms to the HMA and the Laplace approximation. 
The overall results are discussed and summarized in 



2 Integration on a random mesh 

Assume that we have a MC generated sample of random variates with density f(x). Recall that any 
moment or expectation with respect to this density may be computed as a single sum, independent 
of the parameter-space dimension! This powerful and seemingly in nocuous property follows from 



the power of Lebesgue integration (e.g. ICapinski and Koppl . 120071 ). To see this, let us begin by 
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considering a one-dimensional integral 

/= f ' f(x)dx (7) 

J a 

where f(x) is non-negative, finite and bounded, that is: < yo < fix) < yN for x £ [a, b]. More 
formally, we may define the Lebesgue integral / over the measure of sets of points x £ £1 = \a, b] 



with measure /U as follows. We assume that f(x) is measurable over f2 and, following iTemple 



(Il97ll . §8 .3), define the measure function M(y) = fi(x\f(x) > y). Clearly, M{y) is monotonic with 



= M(y N ) < M(y) < M(y ) and [i(fl) = b - a. Let S : a < x < x\ < ■ ■ ■ < x N < b be a 
partition of [a, b] with liniAr^oo xq = a and limjv-i-oo x^ = b. In our case, the partition S is our MC 
sample. The choice of S induces a partition of y through yj = f(xj) although the sequence in y 
will no longer be monotonic. For convenience, reorder the y such that y%-\ < y%. Now consider the 
Lebesgue integral of f(x) over f2, 

/= [ b f(x)dx= [ M(y)dy. (8) 

Ja Jf-i( y )en 

We interpret this geometrically as the area under the curve x = f (y)', in other words, we have 
swapped the abscissa and ordinate. To see this, define 

N N 

£ s =^M( 2 / i _i)(y i -y i _ 1 ) and Us = J2 ~ Vi-i) ( 9 ) 

1=1 i=l 

for the partition 5. Define the interval Aj = yi — y%-\. Clearly 

N 

Us-C S <Y, WiVi) - M(ik_i)] sup(A i ) = [M(b) - M(o)] sup(A,) (10) 
i=i 

and, therefore, lim7v-s-oo(^5 — £s) ~^ since M(y) is monotonic and liniAr_ s . 0O Aj — > 0. Using this, 
we may evaluate the integral in equation ([8]) as follows: 

rVN 

I = M (y )y + lim C s = M (y )y + Km U s = M(y )y + / M(y) dy. (11) 

N^oo N^oo J yo 

The sums in equation ([9]) have the form of a rectangular quadrature of the measure over the range of 
f(x). This geometry is illustrated in Figure [TJ Although the Lebesgue integration theory is general, 
the equivalence of the one-dimensional integral and equation (jlip is easily seen by rewriting equation 
([7]) as a two dimensional integral and changing the order of integration using elementary techniques 
as follows: 

ft> fb ( ff(x) "| ffmax 

1= / f(x)dx= / <^ / dy\dx = M(y )yo+ / M(y)dy 

J a Ja [JO I J fmin 
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where f m in an d fmax are the minimum and maximum values of f{x) in [a, 6]. 
An average of the sums in equation ([9]) gives us a trapezoidal rule analog: 



N 



Ts = \j2 t M (y-i) + M (y^ (yi - vi-i)- ( 12 ) 

i=l 

Further generalization is supported by the Lebesgue theory of differentiation. A central result of the 
theory is that a continuous, m onotonic function in some interval is differentiable almost everywhere 
in the interval ( Temple . 197ll ). This applies to the measure function M{Y). This result may be 



intuitively motivated in the context of our marginal likelihood calculation as follows. Our measure 
function describes the amount of density with likelihood smaller than some particular value. A 
typical likelihood surface for a physical model is smooth, continuous, and typically consists of 
several discrete modes. Consider constructing M(Y) by increasing the value of Y beginning at the 
point of maximum likelihood peak. Since Y = this is equivalent to beginning at max(L) = Y^ 1 
and decreasing L. Recall that M(Y) decreases from 1 to as Y increases from Yq to oo. Therefore, 
we construct M(Y) from M(Y + AY) by finding the level set corresponding to some value of Y and 
subtracting off the area of the likelihood surface constructed from the perimeter of the set times 
AY". The function M(Y) will decrease smoothly from unity at Yq until Y reaches the peak of the 
second mode. At this point, there may be a discontinuity in the derivative of M(Y) as another 
piece of perimeter joins the level set, but it will be smooth thereafter until the peak of the third 
mode is reached, and so on. Since we expect the contribution to Z to be dominated by a few 
primary modes, this suggests that we can evaluate the integral I numerically using the quadrature 
implied by equation (jlip and possibly even higher-order quadrature rules. These arguments further 
suggest that partitioning Q into separate domains supporting individual modes would improve the 
numerics by removing discontinuities in the derivative of M(Y) and explicitly permitting the use of 
higher-order quadrature rules. This partition may be difficult to construct automatically, however. 

To better control the truncation error for this quadrature, we might like to choose a uniform 
partition in y, A = Aj, to evaluate the sums Cm and IAn- For MC integration, this is not possible. 
Rather, MC selects S with irregular spacings and this induces a partition of y. Motivated by kernel 
density estimation, we may then approximate M{y) by 

1 N 

M(y) = ^J2 @ j(y) ( 13 ) 

3=0 

where ©(•) monotonically increases from to 1 in the vicinity of f(xj). For example, we may 
choose to be the Heaviside function 

e j (y) = l 1 V<f{Xl) ° r V - f{X ^ (14) 
otherwise 
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which assigns a "step" to the upper value of the range in y for each Xj. Alternatively, we may 
consider smoothing functions such as 



®i(y) = \ 



1 + erf 



(15) 



where erf(-) denotes the error function. Then, upon substituting equation (fT3j) into equation (fTTj) 
for Us, we get: 

N N 

I N = M(y )y + £ M{y j ^ 1 )(y j - y^x) = £ (J,jf(xj) (16) 
i=i j=i 

where fij = M(yj_i) — M(yj) and M{y^) = by construction and the final equality follows by 
gathering common factors of yi = f(xi) in the earlier summation. 

For integration over probability distributions, we desire x distributed according some probability 
density g(x), 

/[/]= / f(x)g(x)dx, (17) 
Jr 

which yields 

M(y)= ( g{x)dx (18) 
Jf(x)>y 

with the normalization J g{x)dx = 1, and therefore, M(y) G [0, 1]. Despite the appearance of the 
density g in equation (|17p . only the value of measure has changed, not the formalism, i.e. the points 
Xj are now sampled from g(x) rather than uniformly. 

Let us now consider the connection between equation (|16p and a classic MC integration, although 
we do not need this for later sections of this paper. For a classic MC integration, we choose [ij = 
fi(Q)/N = constant by construction, and with this substitution / becomes the MC approximation: 

3=1 

For integration over probability distributions, we assign fj,j to its expectation value /ij = 1/N and 
the MC integral becomes 

1 - 

^ c = ^E/^-)> ( 19 ) 

3=1 

although intuition suggests that the explicit form from equation (I16p will yield a better result. 

The development leading to equation f)16|) remains nearly unchanged for a multidimensional 
integral: 

f(x)g(x)d k x. (20) 
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As in the one-dimensional case, the Lebesgue integral becomes 

/ = / f{x)g(x)d k x 

= M(y )y + lim C s = M(y )y + lim U s 

N^oo N—>oo 

rVN 

= M(y )y + / M{y)dy. (21) 
Jya 

where the only difference is that M(y) is now the measure of the set of points with y < f(x) with 
x £ M k . Perhaps more remarkable than the similarity of equation (jlip with equation (|2ip is that 
the numerical Lebesgue integral is one-dimensional independent of the dimensionality k. However, 
this does not simplify the computational work; one still needs to sample a subset of M. k to evaluate 
equation (f2Tj) . The MC version proceeds similarly as well: replace Xj by Xj in equation (fT9|) . 

In summary, Monte Carlo integration is most often posed as the expectation over a distribu- 
tion, which, more generally, is a Lebesgue integral. Lebesgue integration and differentiation theory 
suggests alternative computational approaches akin to traditional Riemann-based numerical anal- 
ysis, if the underlying likelihood function and prior probability density are well-behaved functions. 
We will see in the next section that a truncation-error criterion applied to the marginal likelihood 
integral in the form of equation (|2ip can improve the HMA. 



3 Application to the marginal likelihood integral 

Now, given a MC-computed sample from the posterior distribution P(0|D) with prior distribution 
7r(0) and likelihood function L(D\6), how does one compute the marginal likelihood? The integral 
in equation ([2]) states that marginal likelihood is the expectation of the likelihood with respect to 
the prior distribution. This is the same as equation ()2ip with 6 £ Q s C M. k replacing x, L(D\6) 
replacing f(x), ir(0) replacing g{x). Alternatively, returning to equation (J3]), the integral Z = -P(D) 
is implicitly defined by 

The value J is the probability of it over £l s . We will assume that £l s C O; this implies that J < 1 
since f n dOir(0) = 1. In addition, the existence of equation ()22[) implies that L(D\0) > almost 
everywhere in O. Defining Y = L , it follows that the Lebesgue integral of the integral on the 
left-hand-side of equation ([22]) is 



with measure 



K ~ i = J M{Y) dY + M(y ° )y ° (23) 



M(y) = / d6P(0\D). (24) 

iY(D\e)>y 
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Intuitively, one may interpret this construction as follows: divide up the parameter space 9 € 
f2 s C R fc into volume elements sufficiently small that P(0|D) is approximately constant. Then, 
sort these volume elements by their value of E(D|0) = L~ 1 (D\6). The probability element dM = 
M{Y + dY) — M(Y) is the prior probability of the volume between Y and Y + dY. 

Clearly M(Y) E [0, 1] and may be trivially computed from a MCMC-generated posterior dis- 
tribution. Using our finite MC-sampled distributed as the posterior probability, 9 ~ P(0|D), and 
converting the integral to a sum, we have the following simple estimate for Mi = M(Yi): 



N ^ 

3=1 3=1 



w' ' V 1, , M M_l V l r , M - M + M (2<i\ 



where we have defined the left and right end points from equation (|14j) and the mean separately 
so that m]^ < Mj < M"] w '. The indicator function 1{} enforces the inclusion of a contribution 1/N 
for index jf only if {Yj > Y{\ or {Yj > Yi] for the lower and upper form, respectively. Alternatively, 
these sums may be expressed using equations (fT3]) -([T6l). 

We may now estimate the marginal likelihood from equation (|22p using the second part of 
equation (121[) for finite N by gathering terms in Yi to get 

K = j M(Y)dY + M(Y )Y (26) 

i=0 U i=0 V + 7 U 

JV AT AT N 

= E - E 5* = E = 5- < 28 > 

1=1 1=1 1=1 j=l 

In deriving equation ([28]) . the leading term M(Yj)Yq from equation ([2T]) is absorbed into the sum 
and M(Yn) = 0. Assuming that J = 1 in equation ([22]) and using equation ([25]) yields 



Z = P(D) = J IK = \ — V — I 



(29) 



This is an alternative derivation for the "harmonic mean" approximation (HMA) to the marginal 
likelihood. 



3.1 Convergence of K 

The evaluation of the integral K (eq. l26l) may fail both due to insufficient sampling and intrinsic 
divergence. As an example of the former, a sparse sampling may lead to large intervals 5^+i — Yi 
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and inflated truncation error (eq. |27|) . We will consider this further in the next section. As an 
example of the latter, consider the textbook inference of an unknown mean 9 from a sample of N 
normally distributed points x ~ N{Q, cr|). The likelihood function is 



N 



L(D\9) = H 



-{Xi-Of/Ial 



(30) 



where Lq = sup{L : x G K} and x is the sample mean. Let the prior distribution for 6 be M(9q, <t|). 
We use an MCMC algorithm to sample the posterior distribution 9. 

Now, let us evaluate K using Lebesgue integration for this example. To begin, we need the 
measure function M(Y) = /i- 1 (£)|e)>y ^P with Y = L _1 We may use equation (f30|) to solve for 
9 = 9(L), noting that the solution has two branches. After some algebra, we have: 



mod = 1 - - 



ei X -y + A(Y)\_ er{ f X -y-A(Y) 
J 



(31) 



where 



V = 



a 2 9 /N + op 



<4/N + o* 



1 N 

2 ' 9 



Yn = L, 



-1 



A(Y) 



2rr 2 

-jflog(Y/Y ) 



The value y is the variance weighted mean of the prior mean and the sample mean, and a 2 is the 
harmonic mean of the variance of the prior distribution and the variance of the sample mean. The 
value Yq is the minimum value for Y and A(-) describes the offset of 9 with increasing Y. Note 
that A(Y ) = 0. 

Since the values of Y are obtained by a sampling, the sample will not cover \Yq, oo) but will be 
limited from above by the smallest sampled value of the likelihood Y max = L^ in . We define this 
limited value of the Integral K as 



K(Y n 



l: 



dYM{Y) + M(Y )Y 



Y 

1 y-,-, 



dYM(Y) + y 



(32) 



where the last equality uses M(Yq) = 1. Clearly K = K(oo) > K(Y max ). The magnitude of the 
truncation due to finite sampling, K (oo) — K (Y max ) , depends critically on the width of the likelihood 
distribution relative to the prior distribution. We describe this ratio of widths by b = a 2 /(Nag). 
The convergence condition limy_ s . 00 [i('(oo) — K(Y max )] — > requires that M(Y) decreases faster 
than y~ 1-<E for some e > 0. For 6 = and large Y, J Y dYM(Y) increases as log(logy). For b > 0, 
/ dYM(Y) deer eases at least as fast Y h . Figure [2] shows K(oo) — K(Yq) as a function of b and 
suggests that b > 0.1 is sufficient to obtain convergence for practical values of N. Qualitatively, 
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Figure 2: The integral K(Y) — K(Yq) is shown as a function of Y/Yq for various values of the ratio 
b = a^./(Nag). For an uninformative prior distribution ag S> a x /N and 6—^0 and -f^(Y) diverges 
with increasing Y/Yo- For an informative prior distribution 0(b) ~ 1 and K(Y) convergences 
quickly with Y/Yq. 

a prior distribution that limits Y from above (or, equivalently, L from below) will prevent the 
divergence. 

Similar asymptotic expressions for K(Y) may be derived for multivariate normal distributions. 
For simplicity, assume that data is distributed identically in each of k dimensions and, for ease of 
evaluation, take x = y. Then 



where T(a,x) is the upper incomplete gamma function and T(a) is the complete gamma function. 
Using the standard asymptotic formula for T(a,x) in the limit of large Y, one finds that 



This expression reduces to equation (|3ip when k = 1, but more importantly, this shows that the 
tail magnitude of M(Y) increases with dimension k. Figure [3] illustrates this for various values of 
k and b. 

The divergence of K in the limit — > oo shows that normally distributed likelihood function 
with an uninformative prior is divergent. Moreover, Figures [2] and [3] further demonstrate that 




(33) 





(34) 
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(a) b = 0.2 




5 10 15 20 25 

MY/Y„) 



(b) 6 = 0.6 

Figure 3: As in Figure [2j the integral K(Y) — K(Yq) is shown as a function of Y/Yq for the 
ratio b = 0.2, 0.6 for fc-dimensional normal data distributions. The integral K(Y) converges more 
rapidly with increasing b (as in Fig. [2]) but increasingly slowly with k. The run of K(Y) — K(Yq) 
is normalized to 1 at Y = oo to facilitate comparison. 
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weakly informative prior with very small b is likely to be numerically divergent even if K formally 
converges. Intuitively, the cause is clear: if the Markov chain never samples the wings of the prior 
distribution that still make significant contribution to K, then K will increase with sample size. 
Analytically, the failure is caused by the measure decreasing too slowly as Y increases (as described 
at the beginning of Empirically, the convergence of the integral K may be tested by examining 
the run of K(Y) for increasing Y. 

3.2 Application to the MCMC posterior distribution 

Overall, ^highlights the marginal likelihood as a numerical quadrature. We have considered the 
path of approximations leading from the quadrature to standard expression for the HMA. We have 
also considered the intrinsic requirements on the prior distribution so that the meausure M(Y) is 
convergent. This development suggests that there are two sources of error in evaluating equation 
(|22p using equations (|23H and (|2T[> . The first is a truncation error of order (Y^+i — Yi) 2 . The second 
is a bias error resulting from specifying the computational grid by a Monte Carlo procedure. We 
will consider each in turn. 

Thinking again in terms of numerical quadrature, we can trust the sum in equation (|27p when 
adjacent values of Lj are close. The error in K will be dominated by the extremely small values of 
Li that lead to hi = Yi + \ — Yi S> 1. Specifically, the error for such a term will be oc h 2 , and a small 
number of such terms can dominate the error for K. Numerical analysis is based on functional 
approximation of smooth, continuous, and differentiable functions. This truncation error estimate 
assumes that M(Y) is such a function. Although not true everywhere, we have argued in $2]) that 
this assumption should be valid over a countable number of intervals in practice. In addition, we 
expect the sample to be strongly clustered about the value of the posterior mode. By the Central 
Limit Theorem for a likelihood dominated posterior, the distribution of 9 tends toward a normal 
distribution. Therefore, larger N will yield more extremal values of L and increase the divergence 
of hi = — . Eventually, for proper prior distributions and sufficiently large N, the smallest 
possible value of L will be realized as long as L > (see eq. [22] and following discussion) . Further 
sampling will reduce the largest intervals, and this will lead to the decrease of large hi, and finally, 
convergence. 

The second source of error is closely related to the first. After eliminating the divergent samples 
with hi S> 1, the sampled domain £l s will be a subset of the originally defined domain Q s C £1. That 
is, the MCMC sample will not cover all possible values of the parameter vector 9. This implies that 
the numerical quadrature of equation (1221) will yield J < 1. Identification of these error sources 
immediately suggests solutions. Note that this observation does not change the problem definition 
in some new way, but rather, allows us to exploit the MCMC-chosen domain £l s to eliminate the 
divergence for small b described in §3.11 

First, we may decrease the truncation error in K by ordering the posterior sample by increasing 
values of Y and truncating the sequence at the point where hi > h* for some choice /t* <C 1. 
Next, we need a consistent evaluation for J. We may use the sampled posterior distribution itself 



14 



to estimate the sampled volume in f2 s C £1. This may be done straightforwardly using a space 
partitioning structure. A computationally efficient structure is a binary space partition (BSP) 
tree, which divides a region of parameter space into two subregions at each node. The most easily 
implemented tree of this type for arbitrary dimension is the kd-tree (short for k-dimensional tree). 
The computational complexity for building the tree from the iV sampled points in parameter space 
sca les as CHiVlog 2 N) u sing the Quicksort algorithm at each successive level (this may be improved, 
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Cormen et al.1 . 12001I ). Each leaf has zero volume. Each non-leaf node has the minimum volume 
enclosing the points in the node by coordinate planes. Assigning the volume containing a fixed 
number of leaves rh (e.g. m = 16 or 32), and some representative value of the prior probability in 
each node (such as a p-quantile or mean value) , one may immediately sum product of each volume 
and value to yield an estimate of J. For modest values N, we will almost certainly find that J < 1. 
Since the MCMC chain provides the values of tt(0) and P(Q) = ir(0)L(D\0), we may use the same 
tree to evaluate both Z and J over the sampled volume O s . 

The example in ^3. II suggests that evaluation of K may stymied by poor convergence unless the 
prior distribution is restrictive. Therefore, if the value of K is divergent or very slowly convergent, 
the evaluation of Z using K/J will fail whether or not we use the improved truncation criteria. 
Direct evaluation of the Z is free from this divergence and remains an practical option in this case. 
The advantage of a direct evaluation is clear: the converged Markov chain samples the domain f2 
proportional to the integrand of equation ([2]), and therefore, we expect 

lim / dOn(0)L{T)\e) » lim / d07r(0)L(D|0) -> 

i-+°°Jn, N ^°°Jn\n„ 
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for large sample size by construction. We propose a hybrid of cubature and Monte Carlo integration. 
The BSP tree provides a tiling of multidimensional volume by using the posterior distribution to 
define volume elements, AV. We use a p-quantile (such as the p = 0.5 median) or mean value 
of the posterior probability or the prior probability to assign a probability value to each volume 
element. An approximation to the integrals Z and J follow from summing the field values over the 
volume elements, analogous to a multidimensional Riemann rule. 

Although ir(0)L(D\6)AV = constant for a infinite posterior sample, there are several sources 
of error in practice. First, the variance in the tessellated parameter-space volume will increase 
with increasing volume and decreasing posterior probability. This variance may be estimated by 
bootstrap. Secondly, the truncation error of the cubature increases with the number of points per 
element. As usual, there is a variance-bias trade off choosing the resolution of the tiling: the bias of 
the probability value estimate increases and the variance decreases as the number of sample points 
per volume element increases. The prior probability value will be slowing varying over the posterior 
sample for a typical likelihood-dominated posterior distribution, so the bias will be small. This 
suggests that larger numbers of points per cell will be better for the evaluation of J and a smaller 
number will be better for Z. Some practical examples suggest that the resulting estimates are not 
strongly sensitive to the number of points per cell (fh = 16 or 32 appears to be a good compromise). 



15 



Almost certainly, there will be a bias toward larger volume and therefore larger values of Z and 
this bias will increase with dimension most likely. 

To summarize, we have described two approaches for numerically computing Z from a MCMC 
posterior simulation. The first evaluates of the integral K by numerical Lebesgue integration, and 
the second evaluates Z directly by a parameter space partition obtained from the sampled poste- 
rior distribution. The first is closely related to the HMA. It applies ideas of numerical analysis the 
integral that defines the HMA. The second is more closely related to the Laplace approximation. 
In some sense, Laplace approximation is an integral of a parametric fit to the posterior distribu- 
tion. The tree integration described above is, in essence, an integral of a non-parametric fit to 
the posterior distribution. The advantage of the first method its amenability to analysis. The 
disadvantage is the limitation on convergence as illustrated in §3.11 The advantage of the second 
method is its guaranteed convergence. The disadvantage is its clear, intrinsic bias and variance. 
The variance could be decreased, presumably, using high-dimensional Voronoi triangulation but 
not without dramatic computational cost. 



3.3 Discussion of previous work on the HMA 



Th e HMA is treate d as a n expectation value in the literature. One of the failures pointed out 
by Meng and Wong (j 19961 ) and others is that the HMA is particularly awful when the sample is 
a single datum. In the context of the numerical arguments here, this is no surprise: one cannot 
accurately evaluate a quadrature with a single point! Even for larger samples, the HMA is formally 
unstable in t he limit of a thin-ta iled likelihood function owing to the divergence of the variance 
of the HMA. iRaftery et al.l (120071 ) address this failure of the statistic directly, proposing methods 
for stabilizing the harmonic mean estimator by reducing the parameter space to yield heavier- 
tailed densities. This is a good alt ernative to the analysis presented here when such stabilization is 
feasible. As previously mentioned, Wolpert ( 20021 ) presents conditions on the posterior distribution 
for the consistency of the HMA. Intuitively, there is a duality with the current approach. Trimming 
the Lebesgue quadrature sum so that the interval Yi + \ — Yi<h 1f is equivalent to lopping off the 
poorly sampled tail of the posterior distribution. This truncation will be one-sided in the estimate 
of the marginal likelihood Z since it removes some of the sample space. However, this may be 
compensated by an appropriate estimation of J. 



4 The new algorithms 

The exposition and development in $3]identifies the culprits in the failure of the HMA: (1) truncation 
error in the evaluation of the measure M(Y); and (2) the erroneous assumption that J = 1 when 
O s C f2 in practice. We now present two new algorithms, the Numerical Lebesgue Algorithm (NLA) 
and the Volume Tessellation Algorithm (VTA), that implement the strategies described in ^3.21 to 
diagnose and mitigate this error. NLA computes K and VTA computes J and, optionally, Z 
directly from equation ([2]). In the following sections, we assume that Del*. 
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4.1 Description 



We begin with a converged MCMC sample from the posterior distribution. After the initial sort in 
the values of Lj, the NLA computes the difference hj = Lj\ — LJ 1 with j = N, N — 1, . . . to find 
the first value of j = n satisfying hj < h*. The algorithm then computes the Mi for i = n, . . . , N 
using equation ([25)) . For completeness, we compute the Mj using both the restriction L(D\6) > Lj 
and L(D\6) > Li to obtain lower and upper estimate for M(Y). Then, these may be combined 
with Cs and Us from £}2]to Riemann-like upper and lower bounds on K . See the listing below for 
details. Excepting the sort, the work required to implement this algorithm is only slightly harder 
than the HMA. 

The VTA uses a kd-tree to partition the N samples from posterior distribution into a spatial 
regions. These tree algorithms split K. on planes perpendicular to one of the coordinate system axes. 
The implementation described here uses the median value along one of axes (a balanced kd-tree). 
This differs from general BSP trees, in which arbitrary splitting planes can be used. There are, 
no doubt, better choices for space partitioning such as Voronoi tessellation as previously discussed, 
but the kd-tree is fast, easy to implement, and published libraries for arbitrary dimensionality are 
available. Traditionally, every node of a kd-tree, from the root to the leaves, stores a point. In 
the implementation used here, the points are stored in leaf nodes only, although each splitting 
plane still goes through one of the points. This choice facilitates the computation of the volume 
spanned by the points for each node as follows. Let fhj be the number of parameter-space points 
0^ n \n = 1, . . . ,fhj in the j^ 1 node. Let fW denote the field quantities at each point 6^ n \ Some 
relevant field quantities include the values of the unnormalized posterior probability and the prior 
probability. The volume for Node j is 

k 

Vj = n [max(#! 1] , . . . , 9 [ p ] ) - mm(0? 1] , . . . , 9^)] . (35) 
i=l 

The set of nodes with fh = fhj = 2 q for some fixed integer q, determines an exclusive volume 
partition of the parameter space spanned by the point set, the frontier. The value of q is chosen 
large enough to limit the sampling bias of field quantities in the volume but small enough resolve the 
posterior modes of interest. The values q G [2, . . . , 6] seem to be good choices for many applications. 
Each node in the frontier is assigned a representative value f*. I use p-quantiles with p = 0.1, 0.5, 0.9 
for tests here. The resulting estimate of the integrals J and/or Z follow from summing the product 
of the frontier volumes with their values f*. 



4.2 Performance 

Both the NLA and the VTA begin with a sort of the likelihood sequence L and this scales as 
0(N log N). In the NLA, the computation of the followed by the computation of Z is 0{N). 
The sequence {M^} is useful also for diagnosis as we will explore in the next section. However, in 
many cases, we do not need the individual Mj but only need the differential value M% — Mi+i to 
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compute Z, which contains a single term. The values of likelihood may range over many orders of 
magnitude. Owing to the finite mantissa, the differential value be necessary to achieve adequate 
precision for large N, and the NLA may be modified accordingly. The algorithm computes the 
lower, upper, and trapezoid-rule sums (eqns. [9H12P for the final integral Z. For large posterior 
samples, e.g. N > 10000, the differences between C$ and Us are small. Indeed, a more useful error 
estimate may be obtained by a random partitioning and subsampling of the original sequence {L^} 
to estimate the distribution of Z (see the examples in £JS]). In practice, computing the marginal 
likelihood from a posterior sample with N = 400000 takes 0.2 CPU seconds on a single 2Ghz 
Opteron processor. Although NLA could be easily parallelized over n processors to reduce the 
total runtime by 1/n this seems unnecessary. 

The kd-tree construction in VTA scales as 0(kN log 2 N) followed by a tree walk to sum over 
differential node volumes to obtain the final integral estimates that scales as 0(N log N). This 
scaling was confirmed empirically using the multidimensional example described in ^5.31 with di- 
mension k G [1,40] and sample size N G [1000,10000000]. Computing the marginal likelihood 
from a posterior sample with N = 400000 and k = 10 takes 4.4 CPU seconds on a single 2Ghz 
Opteron processor, and, therefore, the computation is unlikely to be an analysis bottleneck, even 
when resampling to produce a variance estimate. The leading coefficient appears to vary quite 
weakly the distribution, although there may be undiscovered pathological cases that significantly 
degrade performance. The required value of N increases with parameter dimension k; N = 400000 
is barely sufficient for k = 40 in tests below. Subsampling recommends the use of even larger chains 
to mitigate dependence of the samples. Therefore, the first practical hardware limitation is likely 
to be sufficient RAM to keep the data in core. 



5 Tests &£ Examples 



To estimate the marginal likelihood using the methods of the previous section, we may use either the 
NLA to estimate K and the VTA to estimate J or use the VTA alone to estimate Z. Examples below 
explore the performance of these strategies. The M CMC posterior s imulations are all computed 
using the UMass Bayesian Inference Engine (BIE, IWeinberej . l2Q10h . a general-purpose parallel 
software platform for Bayesian computation. All e xample s excep t that in §5 .31 simulate the posterior 
distribution using the parallel tempering scheme (|Geverl . Il99ll ) with T = 128 and 20 temperature 



levels . Convergence was asses sed using the subsampling a lgorithm described in (jGiakoumatos et al. 
19991 ). a generalization of the Gelman and Rubin (|l992l ) test. 



5.1 Fidelity of the NLA and the VTA 

For a simple initial example, let us compute the marginal likelihood for a data sample D of 100 
points x ~ AA(0.5,0.03) modelled by A/"(6»,0.03) with prior distribution for 6 ~ W(-0.2,1.2). The 
marginal likelihood Z can be computed analytically from D for this simple example. The final 
200,000 converged states of the MCMC-generated chain were retained. Application of the NLA 
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Algorithm NL. Algorithm to compute the marginal likelihood from a posterior sample by 
Lebesgue integration. In Line 3, the value of h* can be chosen to trim anomalous values of 
likelihood. In Line 9, both choices for the inequality in 0, (see eq. [T^|) can be computed and 
stored simultaneously. Combined with Line 18, we can bracket the values by lower and upper sums 
(algorithmic error). 

Require: Likelihood values {Lj},j = 1, . . . , N from the simulated posterior distribution 
1: Sort the sequence so that {Lj > Lj—i} 

2: n = N lj find the smallest n obeying the threshold condition 

3: while L~\ - L^ 1 < h* do 

4: n <— n — 1 

5: end while 

6: for i = n to iV do 

7: Mi <— / / compute the measure, see eq. [25] 
8: for j = i to N do 

9: Mi^c- Mi + l/Lj and/or Mj <- Mi + 1/L j+1 
10: end for 
11: Save the values Mj 
12: end for 
13: for i = n to iV do 

14: Mj <— Mj/M^v // normalize the prior measure 
15: end for 

16: Z <— // compute the marginal likelihood, see eqns. [9H121 
17: for i = n to iV do 

{Lj lower sum "1 

Lj+i upper sum > 

(Lj + Lj + i)/2 trapezoidal rule J 

19: end for 

20: Save the estimated marginal likelihood, Z and algorithmic error 
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Algorithm VT. Algorithm to estimate J = f n d6 ir(6\M) over the domain Q s C Q sampled by 
the MCMC algorithm using the space partitioning kd-tree. 

Require: Likelihood values {Lj},j = 1, . . . , N from the simulated posterior distribution 
1: Change variables Yj = l/Lj 
2: Sort the sequence so that {Yj-\ < Yj} 
3: Create an empty point set V 

4: n «— 1 // find the largest n obeying the threshold condition 

5: while Y n+ i — Y n < /i* do 

6: n ^— n + 1 

7: Add (6»M,fW) to p 

8: end while 

9: Node root = BuildKD(P) 

10: Find the set of frontier nodes T with the desired number of points rfi 
11: 

12: for each node j G T do 

13: Compute the median value of f ^ among the fh points 
14: G G + V,- x median(f ) 
15: end for 

16: Save the estimated value of the integrals G 
Procedure: BuildKD('P) 

Require: A set posterior points and values (6^- n \{^) G V 
1: if "P contains only one member then 
2: return pointer to this leaf 
3: else 

4: Compute and store the range for each coordinate and the volume for this node 
5: Locate the coordinate dimension j € 1 ... k with maximum variance 
6: Determine the median value of 9j[i] (e.g. by Quicksort) 

7: Split V into two subsets by the hyperplane defined by median^): Vi e ft,V r i g ht- 
8: Node ie/t = BuildKD(P /e/t ) 
9: Node righ t = BuMKD(P ri g ht ) 
10: end if 

11: return pointer to the root node of the constructed tree 
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for K and the VTA for J gives a value of \ogZ = 31.15 ± 0.02 (95% confidence interval), close 
to but systematically smaller than the analytic result: \ogZ = 31.36. A value of h* = 0.05 seems 
appropriate from numerical considerations, although experiments suggest that the algorithm is not 
sensitive to this choice as long as /i* is not so small to decimate the sample or so large that error- 
prone outliers are included. It is prudent to check a range of to determine the appropriate value 
of each problem. The VTA yields logZ = 31.34 ± 0.01, consistent with the analytic result. The 
bias in the first estimate appears to be caused by an overestimate of J produced by the VTA. This 
might be improved by a space partition whose cells have smaller surface to volumes ratios ( §5.31 for 
a graphical example). The bias is much less pronounced in the direct estimate of Z by the VTA 
owing to smallness of the posterior probability in the extrema of the sample. These extrema result 
in anomalously small value of logZ = —289.8 for the HMA. 




(a) M(Y) (b) K(Y) - Y 

Figure 4: Details of the marginal likelihood computation illustrating the numerical Lebesgue ap- 
proach. Panel (a) compares the run of measure M function with Y = Lq/L computed from posterior 
simulation with N = 400000 elements using the NLA. Panel (b) shows the Lebesgue quadrature 
term, K(Y) — Yq from eq. [32j Ki ower j upper — Yq are the lower and upper Riemann sums. The sum 
K converges as long as M decreases faster than 1/Y. This illustrates the essence of the algorithm: 
anomalously small values of L degrade the fidelity of M at large Y = Lq/L but these same values of 
M make negligible contribution to K and therefore, may be truncated from the quadrature sums. 

Figure H] illustrates the details of the NLA applied to this computation. Panel (a) plots M from 
equation (|25p . The run of M with Y rises rapidly near the posterior mode and drops rapidly to 
zero for small likelihood values. The inset in this figure shows M in linear scale. The measure 
function M, and hence the integral K, is dominated by large values of L as long as M decreases 
sufficiently fast (see ^3. 1 1) . Panel (b) plots the accumulating sum defining the quadrature of Z 
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in equations (f9|)-(|12p. beginning with the largest values of likelihood first. The contribution to 
Z is dominated at the likelihood peak, corresponding to the steeply rising region of M in Panel 
(a). In other words, the samples with small values of L that degrade the HMA make a negligible 
contribution to the marginal likelihood computation as long as hi < /i*. In addition, NLA provides 
upper and lower bounds, and thereby some warning when the estimate is poorly conditioned, e.g. 
owing to an inappropriate choice for h*. The plot in Figure 0b will readily reveal such failures. 

A more realistic error assessment can be obtained by subsampling the sequence The CPU 

time for these algorithms is sufficiently small that this procedure should be practical in general. 
Consider the following experiment: (1) the posterior is simulated by MCMC (as described above) 
to obtain a chain of 400,000 states; (2) the first half of the chain is discarded; (3) the second- 
half is randomly subsampled with replacement to obtain 128 samples of 10,000 states; (4) the 
marginal likelihood for each is computed using the NLA, VTA, the Laplace approximation and 
the HMA (approximately 2 CPU minute in total). For all but the HMA, increasing the number 
of states decreases the variance for each distribution; samples with 10,000 states best revealed the 
differences between the algorithms with a single scale. 

Figure [5] illustrates the relative performance with different prior distributions. Figure [5^ is the 
model described at the beginning of this section; the range of the prior distribution is much larger 
than the values sampled from the posterior distribution. The prior distributions for each successive 
panel have smaller ranges as indicated. The colors are composited^ with a = 0.5 (e.g. HMA over 
VTA is brown, HMA over NLA is purple, Laplace over HMA is blue-grey, Laplace over VTA is 
blue-green). In Panel (d), the range is within the range of values sampled by the posterior in 
Panel (a). The overall trends are as follows: 1) the HMA has unacceptably large variance unless 
the domain of the prior roughly coincides with the domain sampled by the MCMC algorithm; 2) 
the VTA and Laplace approximation have the smallest variances, followed by HMA; 3) the NLA 
is consistently biased below the analytic value; and 4) the VTA and Laplace approximation are 
closed to the expected analytic value. Indeed, the Laplace approximation is an ideal match to and 
should do well for this simple unimodal model. In the final panel, there are no outlier values of L 
and the harmonic mean approximation is comparable to the others. These tests also demonstrate 
that the same outliers that wreck the HMA have much less affect on NLA and VTA. Further 
experimentation reveals that the results are very insensitive to the threshold value h*. In fact, one 
needs an absurdly large value of /t*, /t* > 1, to produce failure. 

5.2 Non-nested Linear Regression Models 

Here, we test these algorithms on the radiata pine compressive strength data analyzed by Han and 
Carlin (2001) and a number of previous authors. We use the data tabled by Han and Carlin from 
Williams (1959). These data describe the maximum compressive strength parallel to the grain j/j, 
the density Xi, and the resin-adjusted density z% for A = 42 specimens. Carlin and Chib (1995) 

1 For each color channel, value ci over co yields the new value c = (1 — q)co + aci. 
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A 32.6 



34.30 



log Z log Z 



(c) 0.4 < x < 0.6 (d) 0.46 < x < 0.54 

Figure 5: Histogrammed distributions of T for the NLA, VTA, HMA, and the Laplace approxima- 
tion for 10,000 randomly resampled states of a converged posterior distribution of 200,000 states. 
The dashed line shows the true value computed by directly integrating Z from eq. [2j Each panel 
is labeled by the range of the flat prior distribution for the position of the normal distribution. 
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use these data to compare the two linear regression models: 

M = l: yi = a + f3{x i -x)+e l , e; ~ A/"(0, ct 2 ), i = l,...,N 
M = 2: Vl = 7 + 8{zi - z) + e i: £i ~ A/"(0, r 2 ), i = l,...,N 

with M = {1,2}, 0i = {a,/3,a 2 } T , and 9 2 = {7,<5,r 2 } T . We follow Han and Carlin (2001) and 
Carlin and Chib (1995), adopting M ({3000, 185} T , Diag{10 6 , 10 4 }) priors on {a,/3} T and {7,<5} T , 
and IG (3, [2 * 300 2 ] -1 ) priors on a 2 and r 2 , where IG(a, b) is the inverse gamma distribution with 
density function 

e -i/(H 
^ V > = T(a)b a v a+1 

where v > and a, b > 0. Han and Carlin point out these priors are approximately centered on the 
least-squares solution but are otherwise rather vague. Using direct integration, Green and O'Hagan 
(1998) find a Bayes factor of about 4862 in favor of Model 2. 



Table 1: Marginal likelihood for non- nested linear regression models 



Model 


log Z{M = 1) 


log Z(M = 2) 


B21 


A% 


NLA 
VTA 
HMA 
Laplace 


-309.69 jJJJ 
-308.30 t° f 2 
-379.99™ 
-306.661JS 


-301.20l£g 
-299.83 +°;° 2 
-386.52 1 2 ^ 6 
-298.15lJ;°J 


4866 1^ 
4741 till 

u -0 
4974+^18 


0.1 
-2.5 
-100.0 
2.3 



Table Q] describes the results of applying the algorithms from previous sections to a converged 
MCMC chain of 2.4 million states for both models using the parallel tempering scheme. The quoted 
value is the median and the bounds are the p = 0.025 and p = 0.975 quantiles computed from 1024 
bootstrap samples of 100,000 states. I chose 100,000 state samples to achieve 95% confidence bounds 
of approximately 10% or smaller for both the NLA and VTA. The second and third columns of the 
table are the value of marginal likelihood for Models 1 and 2 for each of the four models listed in 
the first column. The quoted range is the 95% confidence bounds for each median value from the 
1024 samples. The fourth column is the Bayes factor for Model 2 to Model 1 and the fifth column 
is the relative difference from the exact result. The NLA, VTA and Laplace approximation yield 
values within a few percent of the true value. The VTA presents the smallest variance, followed 
by Laplace and then NLA. The HMA samples are too broadly distributed to be of use. Figure [6] 
shows the distribution of B21 for the samples; counter to the trend from §5-H both the VTA and 
Laplace approximation are more biased than the NLA here. 

The value h* used to compute the NLA will vary with the problem and the sample size. There- 
fore, some analysis of Z is required to choose an appropriate value. As an example, Figure [7] plots 
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the median and 95% confidence region for the bootstrap sampled marginal likelihood computation 
as a function of h* for the regression problem. The value of the VTA for the same truncated sample 
is shown for reference only; truncation is not needed for the VTA. The values for Z track each other 
closely for 0.001 < h* < 0.008. For /i* < 0.001, there are too few states for a reliable computation 
of Z. For /i* > 0.008, the NLA values are sensitive to the lowdikelihood tail, resulting in divergence 
with increasing /i*. 



0.007 



0.006 



0.005 - 




0,004 



0,003 - 





Figure 6: The histogrammed distribution of Bayes factors for the 1024 samples using the NLA, 
VTA and Laplace approximation. Although the variance for the NLA is larger than the VTA or 
Laplace approximation, its bias is small. 



5.3 High-dimension parameter spaces 

We adopt a 'data-free' likelihood function for parameter vector 6 with rank k: 

L(e) = {27ra 2 y k/2 e- 92 ^ 2 . 

with a 1 = constant. Further, we assume that each parameter 0j is normally distributed with a 
mean of and a variance of 1. The resulting expression for the marginal likelihood may be directly 
integrated, yielding P(a 2 , k) = [2n(l + a 2 )] k ^ 2 . 
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Model 1 



Model 2 




Figure 7: Comparison of the NLA and VTA as a function of h* for Models 1 and 2. The upper panel 
shows the run of Z for increasing h*; the lower panel shows the number of states out of 100000 that 
meet the h* threshold criterion. This increases to 100000 as h* increases; for a threshold = 0.06, 
approximately 100 states are rejected. The median (95% confidence region) is shown as a solid line 
(shaded band). The VTA 95% confidence region is nearly indistinguishable from the line width! 



For each m odel of dim e nsion k, we compute a Markov chain using the Differential Evolution 
algorithm (DE. iTer Braakl . boOfil ). This algorithm evolves an ensemble of chains with initial con- 
ditions sampled from the prior distribution. A proposal is computing by randomly selecting pairs 
of states from the ensemble and using a multiple of their difference; this automatically 'tunes' the 
prop osal width. We have further augmented this algor i thm b y including a tempered simulation 
step (|Neaj 119961 1 after every 20 DE steps (see I Weinberg] . |2010| . for more details^ . 

Each row describes of Table [2] describes the application of the NLA, VTA, and Laplace ap- 
proximation to a model of dimension k. The MCMC simulations produce approximately 1.4 mil- 
lion converged states. Convergence is testing using the Gelman- Rubin statistic (op. cit.). Each 
converged chain is resampled with replacement to provide 1024 subsamples of n states. The value 
N E [10000, 400000] is chosen to achieve 95% confidence intervals approximately 1% of Z or smaller. 
The 95% confidence intervals on Z are indicated as sub- and super-scripts. Recall that the stan- 
dard VTA determines volume spanned fh samples and approximates the integral by multiplying 
the volume by the median value of the sample. To assess the variance inherent in this choice, I 
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Table 2: Test of high-dimensional marginal likelihood 



Model 



NLA 



VTA 



Laplace 



Exact 



logZ 



A% 



log Z . 



log Zqj 



logZ, 



0.9 



A% 



logZ 



A% 



-1.468 
-2.936 
-7.341 



10 -14.68 
20 -29.36 
40 -58.73 



-1.45 



+0.01 
-0.01 



_ 2 04+0.02 

z ' y ^-0.04 



-7.31 
-14.47 
-29.23 
-59.51 



+0.01 
-0.01 

+0.01 
-2.79 

+0.01 
-0.01 

+0.23 
-0.14 



0.7 

0.5 
0.4 
1.4 
0.4 
1.3 



-1.45 

-2.92 

-6.90 

-14.56 

-29.38 

-59.69 



-1.45 
-2.92 
-7.35 



+0.01 
-0.01 

+0.01 
-0.01 

+0.01 
-0.01 



-14 44+0-01 

-29.141JS 
-59.01 S» 



-1.45 

-2.92 

-7.48 

-14.34 

-28.91 

-58.97 



0.7 
0.5 
0.1 
1.6 
0.7 
0.9 



-1.60 
-3.20 
-7.99 
-16.06 
-32.38 
-56.59 



+0.01 
-0.01 

+0.01 
-0.01 

+0.01 
-0.01 

+0.04 
-0.03 

+0.08 
-0.07 

+0.01 
-0.01 



9.0 

9.0 

8.7 

9.4 

10 

8.1 



quote the results for two other p-quantiles, p = 0.1 and p = 0.9. Finally, for each algorithm, the 
table presents the relative error: A% = |logZ — logZ|/| logZ| x 100. 

Both the NLA and VTA results are very encouraging: the relative error is within a few percent 
for 1 < k < 40. For k = 40, I computed Z with samples sizes of 400,000 states. Both the 
NLA and VTA tend to slightly overestimate Z for large k. The Laplace approximation results are 
disappointing for small k and improve for large k, but still are less precise than either the NLA or 
VTA. 

Figure [8] illustrates the kd-tree construction for a single k = 2 sample. Each two-dimensional 
cell is colored by the median value of the posterior probability for the fh = 32 points in each cell 
and scaled to the peak value of posterior probability P for the entire sample. A careful by-eye 
examination of the cell shape reveals a preponderance of large axis-ratio rectangles; this is a well- 
known artifact of the kd-tree algorithm. For large values of P, the volume elements are small, and 
with a sufficiently large sample, the gradient in P across the volume are small. For small values 
of P, the volume elements are large, the gradients are large, and the large-axis ratio rectangles 
distort the reconstructed shape of the true posterior. However, as described in §3.21 the values of 
ir(6)L(D\6)AV = constant for an infinite sample, so a small number of distorted rectangles will not 
compromise the end result. Moreover, the values of ir(0)L(D\0)AV at large volumes are smaller 
than those at small volume for these tests, and this further decreases the importance of the kd-tree 
cell-shape artifact. 
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n 1 

-- 0.8 

-- 0.6 

-- 0.4 

-- 0.2 




-2-1012 

x 1 

Figure 8: Two-dimensional illustration of the domain decomposition for the Gaussian likelihood 
example described in §5.31 The cells are colored according to posterior probability on a linear scale 
from to sup{P}. 



5.4 Model selection 



As an example of model selection, we first compute the marginal likelihood for the same data 
x ~ A/"(0.5, 0.03) as in the first example of £ j5.1l but assuming a Cauchy-Lorentz distribution, 



C(a,b) : P(x\a,b) 



■Kb 1 + 



(x — a) 



b 2 



-i 



as the model with unknown location a and scale b parameters. For prior distributions, we take 
a ~ Z//(0, 1) and b ~ W(0.025, 1) where W(A, k) is the Weibull distribution with scale parameter A 



and shape parameter k. NLA yields logZ = 5.10^qqo, VTA yields logZ 



5.12+°;°2 and the HMA 



yields log Z = 6.62_ 1 ^ L9 . The data and fits are shown in Figure[9h. There should be no surprise that 
the true model (with log Z = 33.5)is strongly preferred. Let us now repeat the experiment using 100 
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X 



(a) x ~ A/"(0.5, 0.03) (b) x ~ Cauchy(0.5, V / 0T03) 

Figure 9: A histogrammed distribution of the 1000 data points from M (0.5, 0.03) Panel (a) and 
100 data points from C(0.5, V0.03) Panel (b) used in these examples compared with the "best fit" 
Normal and Cauchy distributions chosen from the peak of the posterior distribution. 




-75.0 

N 



-76.0 




Figure 10: Box and whisker plot for the distribution of the \ogZ but for a sample from the Cauchy 
distribution from Fig. [9) The box shows the quantiles and median, the whisker shows the (10%, 
90%) intervals, followed by outlying points. The three distributions are (1) the HMA; (2) NLA for 
K and VTA for J; and (3) VTA for both J and Z. 
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data points selected from the Cauchy-Lorentz distribution {Z\) and compare the marginal likelihood 
values for a Cauchy-Lorentz distribution and a mixture of two Normal distributions {Z2). NLA and 
VTA, respectively yield \ogZ x = -76.31^,-76.1^;^ and logZ 2 = -136.9+^,-134.4+^. The 
HMA yields logZi = — 75.9+q'qJ and logZ 2 = — 116.5+Qg. Regardless of the algorithm performing 
the test, the Bayes factor reveals strong evidence in favor of the true model. Note from Figure [9]b 
that both models are reasonable fits "by eye". However, the Bayes factor overwhelmingly prefers 
the simpler (in this case, true) model. As expected, the distribution of Z for the heavy-tailed 
Cauchy distribution is much better behaved (see Fig. [10|h The results for NLA and VTA are 
consistent and the HMA is systematically larger, but non enough to misguide a decision. 



6 Discussion and Summary 

In summary, much of the general measure-theoretic underpinning of probability and statistics nat- 
urally leads naturally to the evaluation of expect ation values. For example, the harmonic mean 
approximation (HMA, iNewton and Rafteryi . fl99l l for the marginal likelihood has large variance 



and is slow to converge (e.g. IWolpertl . 120020 . On the other hand, the use of analytic density func- 
tions for the likelihood and prior permits us to take advantage of less general but possibly more 
powerful computational techniques. In §§[THHwe diagnose the numerical origin of the insufficiencies 
of the HMA using Lebesgue integrals. There are two culprits: 1) the integral on the left-hand side 
of equation ([3]) may diverge if the measure function M(Y = L^ 1 ) from equation (124p decreases too 
slowly; and 2) truncation error may dominate the quadrature of the left-hand side of equation ([3]) 
unless the sample is appropriately truncated. Using numerical quadrature for the marginal likeli- 
hood integral (eqns. [2] and [24")) leads to improved algorithms: the Numerical Lebesgue Algorithm 
(NLA) and the Volume Tessellation Algorithm (VTA). Our proposed algorithms are a bit more 
difficult to implement and have higher computational complexity than the simple HMA, but the 
overall CPU time is rather modest compared to the computational investment required to produce 
the MCMC-sampled posterior distribution itself. For a sample of size N, the sorting required by 
NLA and VTA has computational complexity of 0(N log N) and 0(N log 2 N), respectively, rather 
than 0{N) for the harmonic mean. Nonetheless, the computational time is a fraction of second to 
minutes for typical values of 10 5 < N < 10 8 (see $3J). 

The geometric picture behind NLA is exactly that for Lebesgue integration. Consider integrat- 
ing a function over a two-dimensional domain. In standard Riemann quadrature, one chops the 
domain into rectangles and adds up their area. The sum can be refined by subdividing the rect- 
angles; in the limit of infinitesimal area, the resulting sum is the desired integral. In the Lebesgue 
approach, one draws horizontal slices through the surface and adds up the area of the horizontal 
rectangles formed from the width of the slice and the vertical distance between slices. The sum can 
be refined by making the slices thinner when needed; in the limit of slices of infinitesimal height, 
the resulting sum is the desired integral. In the Riemann case, we multiply the box area in the 
domain, dA, by the function height, /. In the Lebesgue, we multiply the slice height in the range, 
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df, by the domain area, A (see Fig. [I]). Both algorithms easily generalize to higher dimension. 
For the Lebesgue integral, the slices become level sets on the hypersurface implied by the inte- 
grand. Therefore the Lebesgue approach always looks one-dimensional in the level-set value; the 
dimensionality k is 'hidden' in the area of domain (hypervolume A for k > 3) computed by the 
measure function M(Y). The level-set value for the NLA is Y = 1/L(D|0). Once determined, 
NLA applies the trapezoidal rule to the sum over slices and compute the upper and lower rectangle 
sums as bounds. Clearly, the error control on this algorithm might be improved by using more of 
the information about the run of A with /. 

Having realized that the practical failure of the harmonic mean approximation is a consequence 
of the sparsely sampled parameter-space domain, NLA addresses the problem by determining a 
well-sampled subset S] s cU from the MCMC sample, ex post facto. Restricted to this subset, £l s , 
the value of the integral J on the right-hand side of equation ([3]) is less than unity. We determine 
Q s by a binary space partitioning (BSP) tree and compute J from this partition. A BSP tree 
recursively partitions a the k-dimensional param eter space into convex subspaces. The VTA is 
implemented with a kd-tree (jCormen et all boOlh for simplicity. In addition, one may use VTA by 
itself to compute equation ([2]) directly. 

Judged by bias and variance, the test examples do not suggest a strong preference for either 
the NLA or the VTA. However, both are clearly better than the HMA or the Laplace approxi- 
mation. Conversely, because these algorithms exploit the additional structure implied by smooth, 
well-behaved likelihood and prior distribution functions, the algorithms developed here will be in- 
accurately and possibly fail miserably for wild density functions. The NLA and the VTA are not 
completely independent since the NLA uses the tessellation from the VTA to estimate the integral 
J. However, the value of the integral K tends to dominate Z, that is | log S> | log J\, and the 
contributions are easily checked. Based on current results, I tentatively recommend relying prefer- 
entially on VTA for the following reasons: 1) there is no intrinsic divergence; 2) it appears to do 
as well as VTA even in a high-dimensional space; and 3) there is no truncation threshold h*. 

Figure [8] illustrates the potential for volume artifacts that could lead to both bias and variance. 
This error source affects both the VTA and NLA (through the computation of J) but the affect 
on the NLA may be larger ( § 5 . 1 H . Additional real- world testing, especially on high-dimensional 
multimodal posteriors, will provide more insight. In test problems described in this paper, I explored 
the effects of varying the threshold h* and the kd-tree bucket size m. These parameters interact 
the sample distribution, and therefore, are likely to vary for each problem. I also recommend 
implementing both the NLA, VTA, HTM, Laplace approximation and comparing the four for 
each problem. We are currently testing these algorithms for astronomical inference problems too 
complex for a simple example; the results will be reported in future papers. An implementation 
of the se algorithms w ill be provided in the next release of the UMass Bayesian Inference Engine 
(BIE. IWeinberj . l20ld ). 

There are several natural algorithmic extensions and improvements not explored here. $2] de- 
scribes a smoothed approximation to the computation of M(Y) (eqns. [T3TI16P rather than the step 
function used in £JJ1 The direct integration of equation ([2j) currently ignores the location of field 
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values in each cell volume. At the expense of CPU time, the accuracy might be improved by fitting 
the sampled points with low-order multinomials and using the fits to derive a cubature algorithm 
for each cell. In addition, a more sophisticated tree structure may decrease the potential for bias 
by providing a tessellation with "rounder" cells. 
In conclusion, the marginal likelihood 



may be reliably computed from a Monte Carlo posterior sample though careful attention to the 
numerics. We have demonstrated that the error in the HMA is due to samples with very low 
likelihood values but significant prior probability. It follows that their posterior probability also 
very low, and these states tend to be outliers. On the other hand, the converged posterior sample is 
a good representation of the posterior probability density by construction. The proposed algorithms 
define the subdomain £l s C f2 dominated by and well-sampled by the posterior distribution and 
perform the integrals in equation ([3]) over Q s rather than f2. Although more testing is needed, 
these new algorithms promise more reliable estimates for Z from an MCMC simulated posterior 
distribution with more general models than previous algorithms can deliver. 
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